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IN THE CLAIMS : 

Please substitute the following claims for the same^numbered claims in the application: 

1 . (Currently Amended) A method for determining a degree of similarity between 
documents in a given document collection, the method comprising: 
modeling all said documents as labeled tree representations; 

buil ding a computerized dictionary of path representations relating to paths that occur h 
said HArnTYiRnt ft^wherein said path represent a tions comprise a path in the tree representation of 
document having a capability to include positional in formation of preceding sibling nodes that 
have a same label as a given node in a tree ; 

storing, for at least two said documents* said labeled tree representations of respective 

documents; 

storing, for said at least two said documents, said path representations relating to said 
paths that occur in said documents from root nodes to leaf nodes in said labeled tree 
representations of said respective documents; 

representing each of said documents in said document collection as an AT-dimensional 
vector comprising an element / denoting a value of a feature associated with a particular path, 
wherein said feature comprises any of a presence or absence of said particular path in said 
documents and a frequency of occurrence of said particular path in said documents; 

calculating a measure of similarity between two of the documents based upon the 
frequency of occurrence of similar paths specified by the path representations; and 

using said measure of similarity to cluster a plurality of documents comprising similar 
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information, wherein said documents comprise any of web page documents and extensible 
Markup Language (XML) documents, 

wherein two documents that differ only in the frequency of occurrence of the paths 
associated with said two documents are considered to be more similar to each other than two 
documents that differ in the occurrence of paths. 



2. (Previously Presented) The method as claimed in claim 1 , wherein the tree representation 
is a Document Object Model representation. 



3 . (Original) The method as claimed in claim 1 , further comprising the step of generating a 
path representation for a path of a document as a sequence of labels representative from a root 
node to a leaf node in the labeled tree representation of the document 



4. (Original) The method as claimed in claim 1 , further comprising the step of storing, as 
path representations, sets of sequenced labels representative of distinct paths in a labeled tree 
representation of a corresponding document. 



5. (Original) The method as claimed in claim 4, further comprising the step of storing a path 
dictionary {DictpaOa {?a V?> Pv)) of distinct paths collated from a tree representation for a 
document. 



(Original) The method as claimed in claim 5, further comprising the step of eliminating 
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selected paths from the path dictionary (Dictpadu). 

7. (Original) The method as claimed in claim 6, wherein paths that occur highly frequently 
or highly infrequently are eliminated from the path dictionary (Dictpatks)- 

8. (Original) The method as claimed in claim 7, further comprising the step of computing 
the frequency of occurrence <f/(pd) of a path (p.) in a document (df). 

9. (Original) The method as claimed in claim 8, further comprising the step of computing 
the maximum number of instances (f max = max? jjfr)) in which a path (p<) in the document (df) 
occurs. 

1 0. (Original) The method as claimed in claim 9, further comprising the step of storing a 
representation of the document (4) as a JV-dimensional vector ([djj t dp, .„ djv], where djk = 
fj{pk)lfm**> KfcZN) of relative frequencies of occurrence (#/?*)) of paths (p*) in the document 

(<S>- 

11. (Original) The method as claimed in claim 8, further comprising the step of computing 
the minimum number of instances <f min = wii^fjipd) in which a path (p,) in the document (dj) 
occurs. 

12. (Original) The method as claimed in claim 10, further comprising the step of computing 
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the similarity between a pair of documents (d h di) as a function (sim(d h di)) of metrics relating 
the number of paths common to the respective documents (di , di). 

13. (Original) The method as claimed in claim 12, wherein the function for computing the 
similarity between a pair of documents (di , di) 



is the quotient of a numerator, defined as the sum for all paths (k - 1 ... iV) of the 
minimum number of instances (mm(d iki d ik )) in which paths occur in the respective documents 
(d iy d{) 3 and a denominator, defined as the sum for all paths (k=l . . . N) of the maximum number 
of instances (min(^, rf*)) in which paths occur in the respective documents (d it di), 

1 4, (Original) The method as claimed in claim 1 , wherein the tree representation of a 
document includes a positional index, which represents, for a node (w), the number of previous 
sibling nodes with the same label as that of node («). 

15. (Original) The method as claimed in claim 14, further comprising the step of storing as a 
path representation a set that defines positional information of sibling nodes under a parent node. 
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16. (Original) The method as claimed in claim 15, further comprising the step of storing 
precise path representations that precisely define a document structure, and generalised path 
representations that partially generalise structural aspects of precise path representations of a 
document. 

1 7. (Original) The method as claimed in claim 1 6, wherein the step of calculating the 
measure of similarity involves determining a total number of precise path representations of one 
document that are either shared by the other document, or are a subsumed subset of at least one 
of the generalised path representations of the other document. 

18. (Original) The method as claimed in claim 17, further comprising the step of normalising 
the measure of similarity by a term that represents the number of unique path representations 
shared by the two documents. 

1 9. (Original) The method as claimed in claim 18, wherein the number of unique path 
representations is calculated by adding the number of path representations for each document, 
and subtracting from this total the number path representations shared by the two documents. 

20. (Original) The method as claimed in claim 14 5 further comprising the step of storing as a 
path representation a sequence of terms separated by a delimiting symbol, in which each term is 
represented by a label and a parenthesised predicate that specifies the positional index of the 
term either specifically or generally. 
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21-22. (Cancelled). 

23, (Currently Amended) A program storage device readable by computer, tangibly 
embodying a program of instructions executable by said computer to perform a method for 
determining a degree of similarity between documents in a given document collection, the 

method comprising: 

modeling all said documents as labeled tree representations; 

building a computerized dictionary of path representations relating to paths that occur in 
said rin^iiments . wherein said path representations comp rise a path in the tree representation of a 
document having a capability to include positional informati on of preceding sibling nodes that 
have a same label as a given node in a tree ; 

storing, for at least two said documents, said labeled tree representations of respective 

documents; 

storing, for said at least two said documents, said path representations relating to said 
paths that occur in said documents from root nodes to leaf nodes in said labeled tree 
representations of said respective documents; 

representing each of said documents in said document collection as an JV-dimensional 
vector comprising an element / denoting a value of a feature associated with a particular path, 
wherein said feature comprises any of a presence or absence of said particular path in said 
documents and a frequency of occurrence of said particular path in said documents; 

calculating a measure of similarity between two of the documents based upon the 
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frequency of occurence of similar paths specified by the path representations^.]]; and 

using said measure of similarity to cluster a plurality of documents comprising similar 

information, wherein said documents comprise any of web page documents and extensible 

Markup Language (XML) documents, 

wherein two documents that differ only in the frequency of occurrence of the paths 

associated with said two documents are considered to be more similar to each other than two 

documents that differ in the occurrence of paths. 

24. (Previously Presented) The program storage device in claim 23, wherein said method 
further comprises the tree representation is a Document Object Model representation. 

25. (Previously Presented) The program storage device in claim 23, wherein said method 
further comprises the step of generating a path representation for a path of a document as a 
sequence of labels representative from a root node to a leaf node in the labeled tree 
representation of the document. 

26. (Previously Presented) The program storage device in claim 23, wherein said method 
further comprises the step of storing, as path representations, sets of sequenced labels 
representative of distinct paths in a labeled tree representation of a corresponding document. 

27. (Previously Presented) The program storage device in claim 23, wherein the tree 
representation of a document includes a positional index, which represents, for a node (n), the 
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number of previous sibling nodes with the same label as that of node («). 

28. (Currently Amended) A computer system operable for determining a degree of similarity 
between documents in a given document collection, the computer system comprising: 

means for modeling all said documents as labeled tree representations; 

means for building a computerized dictionary of path representations relating to paths 
that occur in said documents , wherein said path representations comprise a path in the tree 
representation of a document having a capability to include positional information of preceding 
sibling nodes that have a same label as a given node in a tree : 

means for storing, for at least two said documents, said labeled tree representations of 
respective documents; 

means for storing, for said at least two said documents, said path representations relating 
to said paths that occur in said documents from root nodes to leaf nodes in said labeled tree 
representations of said respective documents; 

means fox representing each of said documents in said document collection as an N- 
dimensional vector comprising an element i denoting a value of a feature associated with a 
particular path, wherein said feature comprises any of a presence or absence of said particular 
path in said documents and a frequency of occurrence of said particular path in said documents; 

* 

means for calculating a measure of similarity between two of the documents based upon 
the frequency of occurrence of similar paths specified by the path representations; and 

means for using said measure of similarity to cluster a plurality of documents comprising 
similar information, wherein said documents comprise any of web page documents and 
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extensible Markup Language (XML) documents, 

wherein two documents that differ only in the frequency of occurrence of the paths 
associated with said two documents are considered to be more similar to each other than two 
documents that differ in the occurrence of paths. 

29. (Previously Presented) The computer system device in claim 28, wherein said 
representations is a Document Object Model representation. 

30. (Previously Presented) The computer system device in claim 28, further comprising 
means for generating a path representation for a path of a document as a sequence of labels 
representative from a root node to a leaf node in the labeled tree representation of the document 

3 1 . (Previously Presented) The computer system device in claim 28, further comprising 
means for storing, as path representations, sets of sequenced labels representative of distinct 
paths in a labeled tree representation of a corresponding document. 

32. (Previously Presented) The computer system device in claim 28, wherein the tree 
representation of a document includes a positional index, which represents, for a node («), the 
number of previous sibling nodes with the same label as that of node (w). 
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